Reports of 1858 Clear search Modify search
DGS (General)
takahiro.yamamoto - 2:09 Saturday 18 May 2024 (29571) Print this report
balancing DAQ data rate of two NICs on k1dc0

Abstract

After we installed two NICs on k1dc0 for the DAQ stream (see also klog#29110), IPC glitch rate decrease as once per 1-2 days.
Because all glitches occurred on the front-end computers which were connected to the primary NIC and amount of data on the primary NIC was quite larger than one on the secondary NIC, I took a balance of the data rate of two NICs.
We probably need a couple of weeks in current glitch rate to conclude that the situation will be improved or not.

Details

We had installed secondary NIC on k1dc0 for the DAQ stream in order to disperse an amount of data in the work of klog#29110. At that time, we didn't modified the launch script of mx_stream and there was a bias in the amount of data in the two NICs. Fifteen of the 25 front-end computers were connected to the primary NIC with a data volume of 28.1 MB/s. The remaining 10 front-end computers were connected to the secondary NIC with a data volume of 13.6 MB/s.

After this update, glitch rate was decreased from a few ~ a few tens per day to once per 1-2 days. So dual NIC configuration seemed to have some effect to reduce IPC glitches.

Remaining glitches occurred only on the front-end computers which were connected to the primary NIC. As mentioned above, data rate and a number of the front-end computers on the primary NIC were larger than ones on the secondary NIC. So I guessed that data rate and/or a number of front-end computers were related to the glitches and took a balance of them on two NICs.

Since an assignment of front-end computers to each NIC is done in /diskless/root/etc/init.d/mx_stream, the way to determine the card number and endpoint number in this script was changed (original code is commented out). Now, 13 and 12 front-end computers are connected to the primary and the secondary NIC, respectively. And also, I modified an order of front-ends in /diskless/root/etc/rtsystab in order to take a balance of total data rate of each NIC (the old file is kept as rtsystab.20240517). Finally, data rate is also balanced as 21.3MB/s on the primary and 20.4MB/s on the secondary. Attachments are the list of front-end name, data rate, serial number of front-end, endpoint number and card number of before and after this work.

Because I'm not sure that a cause of remaining glitches is really the data rate or not, I don't know the situation is improved or not. Considering the current rate of glitches, it would take a couple of weeks to make any kind of conclusion about the effect of this work.
Non-image files attached to this report
DGS (General)
takahiro.yamamoto - 0:42 Saturday 18 May 2024 (29558) Print this report
Installation of a new ADC card for K1IOO1 (not yet completed)
Though I tried to add a new ADC board in K1IOO1 for f3 WFS, it couldn't be completed as reported in klog#29552.
After cables around the power breaker boxes will be cleaned up, I'll do again.

-----
As a preparation work for the next time, I installed a new AA chassis (S1307462) at U15 of the IOO1 rack. A power cable is already connected but the new AA chassis is not turned on yet (For checking AA chassis is not broken, I turned it on once and then turn it off after today's work). A SCSI cable is not connected because ADC board is not installed yet. ADC board, internal cable, adapter board, and SCSI cable are stored in the server room in the mine.
IOO (General)
takahiro.yamamoto - 15:01 Friday 17 May 2024 (29560) Print this report
k1psl was updated related to removing work of k1mzm

Abstract

k1psl shows IPC errors (see also Fig.1) after removing k1mzm reported in klog#29540.
For removing these errors, I updated and restarted k1psl.

Details

IPC connections which had errors were terminated on k1psl as shown in Fig.2. So I removed these unnecessary Dolphin blocks from k1psl. After then, the model file was re-compiled, installed and restarted. I also removed these unnecessary Dolphin connection from the IPC list (/opt/rtcds/kamioka/k1/chans/ipc/K1.ipc, see also Fig.3) and the channels on k1mzm from the DAQ list (/opt/rtcds/kamioka/k1/target/fb/master, see also Fig.4). Finally, k1mzm was removed from the real-time model list (/diskless/root/etc/rtsystab, see also Fig.5).

Due to the above works, removing k1mzm should have been completed.

Images attached to this report
MIF (ASC)
takahiro.yamamoto - 11:33 Friday 17 May 2024 (29552) Print this report
Comment to Routed the cables and prepared for WFSf3 optics installation in MIFREFL table (29466)
Please do recabling with sensible staffs.
The pwer breaker box cannot be opened because new cables are placed on the box with non-weak tension.
So I couldn't turn OFF the IO chassis in order to add a new ADC board in K1IOO1.

Although it may be possible to eliminate this problem by pulling cables forcefully, NEVER DO SO.
Please remove cables from inside the rack once and then do recabling them in properly.

Anyway, I gave up to add a ADC board today.
Images attached to this comment
DGS (General)
takahiro.yamamoto - 18:52 Thursday 16 May 2024 (29533) Print this report
instantaneous blackout around 12:29
[Aso, Takano, YamaT]

Abstract

Real-time front-ends and some servers downed due to the instantaneous blackout around 12:29.
Fortunately, PR3 oplev light can be found in the SAFE state. So we concluded PR3 didn't slip though it jumped LOCK_ACQUISITION to TRIPPED.
After then, we reverted latest oplev set points which hadn't been saved on SDF yet and checked IMC can be locked.

Recovering Real-time front-ends and DAQs

Not affected
k1omc1, k1ix1, k1iy1, k1ex1, k1ey1, k1ex0, k1ey0, k1iy0, k1px1

Automatically rebooted and recovered
k1imc0, k1sr2, k1test0, k1mcf0, k1dc0

Automatically rebooted after then models hang up due to Dolphin
k1ioo0
All real-time models on it were restarted.

Automatically rebooted but IO chassis couldn't be found
k1ioo1, k1pr2, k1pr3, k1prm, k1sr3, k1srm, k1omc0
All front-end computers were rebooted after disabling dolphin network connection.

Automatically rebooted but timing was lost
k1lsc0
Timing synchronization didn't come back by restarting real-time models and rebooting the front-end computer.
So the front-end computer and the IO chassis turned OFF after dolphin network connection.
After then the front-end computer was booted up without a dolphin cable and the dolphin cable plugged at a proper timing.

Not rebooted but models hang up due to Dolphin
k1asc0, k1als0
All real-time models on them were restarted.

daqd was restarted due to the down of k1dc0
k1fw0, k1fw1, k1nds0, k1nds1, k1tw0, k1tw1

Reverting unsaved oplev set points

Oplev set points hadn't been saved on SDF for some suspensions yet (mainly today's morning work).
When the real-time models downed, these changes vanished. So we reverted them manually.
Following table shows the set points before and after reverting.

DoFold setpointreverted value
IMMT2_P11.324.3
IMMT2_Y0.1-8.0
PR2_P-23.7-3.1
PR2_Y21.817.5
PR3_P-41.7487-39.8
PR3_Y-21.0553-21.6
BS_P17.9994659424-9.5
BS_Y-13.887349128720.6
SR3_P0.0124.0
SR3_Y-100.0-88.5
Comments to this report:
shinji.miyoki - 7:55 Friday 17 May 2024 (29543) Print this report

Around 12:40, Uchiyama-kun called me that Kimura-san reported him the possibility of an instantaneous power outage at the KAGRA. After that, Aso-kun also called me the same thing and that all model were dead in the DGS. At that time I was in the KAGRA center area. I felt nothing of light bringing. Also, at the beginning of the ICRR Faculty meeting from 13:00, I heard that Sekiya-san of SK went to SK area for urgent troubleshooting. So, This power outage seemed to happen in the KAGRA and SK areas.

I checked the KAGRASAVIC-net warning system just after the phone call from Uchiyama-kun, but no report on the power outage was found. So this time power outage seemed to be shorter than the setting threshold. 

Actually, the weather was getting worse and worse at the time, so some lightning could affect it. As you know, the electrical pole construction near the SK entrance was also ongoing.

shinji.miyoki - 12:38 Friday 17 May 2024 (29555) Print this report

Hokuriku power company explained that the grounding trouble happened at Kosugi at 12:27 on 16th.

MIF (General)
takahiro.yamamoto - 10:26 Wednesday 15 May 2024 (29512) Print this report
Comment to Investigation of the oscillation from the common mode servo (29503)

Memo

LIGO's situation
- Input signal of the common mode servo coming from I/Q demodulator is picked up as the differential signal via TNC by using an additional breakout board.
- Circuit design of our common mode board seems to be same as the latest version of LIGO's one.
- Summing Node and CARM servo seems to be connected by a same way as our situation (negative pin is connected to GND).
- CARM FAST and IMC IN2 are connected via 2-pin lemo instead of BNC/TNC, but negative pin of 2-pin lemo for FAST OUT is connected to GND.

Useful documents
1. Schematic view of CARM/ALS: LIGO-G1500456
2. Cabling around ISC: LIGO-D1200666, LIGO-D1900511
3. I/Q demod for differential output on TNC: LIGO-D1000181
4. Common mode servo: LIGO-D0901781
5. Modifications of CMS: LIGO Wiki
6. Summing Node: LIGO-D1200151

VAC (General)
takahiro.yamamoto - 14:41 Monday 13 May 2024 (29476) Print this report
Comment to Vacuum gauge data in the frame files (29474)
Do you mean this problem has started in recent these days or existed for a long time?

Only for the vacuum DAQ (/opt/rtcds/kamioka/k1/chans/daq/K1EDCU_VAC_GAUGE.ini), a parameter of slope is set as 6.1028e-05. It's set as 1.0 for all another system. gwpy (at least v3.0.5) reads FrAdcData (not FrProcData) as "value * slope + offset" as the default behavior. So if we want to ignore the slope and offset parameters, we need to set a parameter of scaled e.g. TimeSeries.fetch(channel, gpsstart, gpsend, scaled=False).

I considered a possibility that DAQ-ini files had been changed accidentally in the work of k1boot maintenance (klog#29361). So I also checked the snapshot before starting maintenance. But vacuum readout was recorded with slope=6.1028e-05 both before and after our maintenance and I concluded it's not a matter of this issue.

Another possible answer is to change the version of gwpy you use. I'm not sure which environment is it used for SEM. But gwpy is not installed in system region and SEM should be used gwpy in anaconda or similar environment. Anaconda environment is constructed on k1nfs0, not on k1boot. So if you haven't updated your environment by yourself in recent these days, the version of gwpy is not a matter of this issue.

If you have never updated your environment, this issue may exist for a long time.

Though I'm not sure who manages a vacuum DAQ-ini file, it will have to be unified as slope=1 same as other similar systems (Ondotori, Cryocon, etc.) and the real time system.
DGS (General)
takahiro.yamamoto - 16:13 Saturday 11 May 2024 (29468) Print this report
Weekly maintenance
Several security updates were applied for the web server, the gateway server and control workstations.
k1ctr13 which was newly installed (klog#29457) in the clean booth was also updated.
VIS (EX)
takahiro.yamamoto - 15:29 Saturday 11 May 2024 (29469) Print this report
ETMX was tripped

ETMX was tripped at 21:56 JST on 10th May as shown in Fig.1 because F3 GAS signal reached the watchdog threshold.

Guardian request was changed READY and TWR_DAMPED many times before VIS_ETMX went to TRIPPED.
There is no guardian node requesting VIS guardians to TWR_DAMPED in my remember. So it seems to be someone's activities.
Because I couldn't know current situation (it's unexpected behavior or a result of someone's activities), I keep it TRIPPED.

Images attached to this report
Comments to this report:
takafumi.ushiba - 18:05 Saturday 11 May 2024 (29470) Print this report

I checked what happens in ETMX and it seems that F3 GAS watchdog hit the threshould during the health check measurement.
When suspension was tripped, F3 GAs control was turned off and F3 GAS excitation was ongoing for health check (fig1).
Since I have no idea why the template doesn't work well now (I don't change template, so it should work, though...), I will check it in next week.

Images attached to this comment
MIF (General)
takahiro.yamamoto - 21:35 Friday 10 May 2024 (29467) Print this report
Comment to Replacement of GrPDHY servo (29456)
MIF (General)
takahiro.yamamoto - 21:29 Friday 10 May 2024 (29460) Print this report
Mitigating a timing gap of BO operation for the CARM common mode servo

[Kamiizumi, Tomura, YamaT]

This is a similar work with klog#29211

Abstract

We added the latch function in order to switch input gain of CARM common mode servo without making serious glitches.
k1alspll and DAQ were restarted to apply the update of the real-time model.
After updating model, glitches at the timing of switching gain is drastically mitigated same as the case of IMC common mode servo.
 

Details

In order to mitigate glitches due to the timing gap between ON and OFF of the binary output, we applied a same model update as IMC servo to CARM servo. Common mode servo block with the latch function was prepared as the library in the last time. So today I just copied it to k1alspll model and restarted the k1alspll and DAQ for applying updates.

We also checked a glitch situation at the OUT2 port when gain was changed. Fig.1 and Fig.2 show readout voltage without the latch function at OUT2 when IN1_GAIN was changed as 16dB->15dB and 15dB->16dB, respectively. As same as IMC case, there is a timing gap as ~100us between ON and OFF of the binary output. In Fig.2 (15dB->16dB), multiple glitches can be seen around t=100us. It seems to come from a timing gap of turning OFF 1, 2, 4, and 8dB stages. In the case of IMC servo, we couldn't find such behaviors. So a presence of such behavior depends on individual differences of BO cards.

Read out voltage with the latch function is shown in Fig.3 (16dB->15dB) and Fig.4 (15dB->16dB). According to these plots, the latch function works fine also for the CARM servo. Let's check the improvement also on the lock acquisition after CARM and IFO comes back.

Images attached to this report
DGS (General)
takahiro.yamamoto - 1:49 Thursday 09 May 2024 (29433) Print this report
K1MCF0 lost timing
Real-time models on K1MCF0 hang up due to lose a timing signal at 11:19.
It seems to be caused by electrical glitches around MCF0 rack (cabling, vacuum work, or other works?).
Models were recovered at 16:01.

DGS (General)
takahiro.yamamoto - 19:34 Sunday 05 May 2024 (29397) Print this report
Comment to MEDM time machine is available (29302)
The time machine code was modified as follows.
1. '.*_OUTMON' defined by cdsFilt (filterbank) blocks show past values of '.*_OUTPUT'.
2. '.*_OUTMON' defined by cdsEpicsOutput blocks show their past values.
(This modification doesn't affect to MEDM showing current values.)

-----
A channel name as '.*_OUTMON is used by one of the output monitor of the filterbank which is NOT recorded by DAQ. So the original (LIGO's) code of the time machine ignores '.*_OUTMON' and shows them as a white square. Unfortunately, many MEDM screens use '.*_OUTMON' instead of '.*_OUTPUT' or '.*_OUT16' in the KAGRA case. And also, some models define channels named as '.*_OUTMON by cdsEpicsOutput blocks which are DAQ-ed channels. Because of such KAGRA's situation, the time machine hadn't been useful for the MEDM screens displaying such '.*_OUTMON' channels.

Possible solutions
'.*_OUTMON' defined by cdsFilt blocks can monitor the output of filter modules even when the OUT_SW of the filter bank is off, but such a function is likely to be used by only a few experts. So almost all '.*_OUTPUT' on current MEDM screens had better to replaced as '.*_OUTPUT'.

'.*_OUTMON' for cdsEpicsOutput may be better to be renamed other name because many tools regard '.*_OUTMON' as a non-DAQ-ed channel.

Because these better solutions seems to require large modifications of real-time models and MEDM screens and much longer time, I modified the treatment of '.*_OUTMON in the time machine code.
DGS (General)
takahiro.yamamoto - 22:12 Friday 03 May 2024 (29390) Print this report
Digital system came back online
This is recovery work from the trouble on klog#29361.
Digital system finally came back online. Now awgtpman also works fine.

-----
When I copied the NFS region (/opt) in order to salvage files, some file permissions were probably copied incompletely. rsync -a by root copies both permissions and owner but rsync -a by controls copies only permissions. Errors on file permission seem to make some delays which causes time out on launching awgtpman.

Now the copied disk by root is mounted at /opt, the copied disk by controls is mounted at /mnt/backup_controls, and original (partially broken) disk is mounted at /mnt/original.
Latest safe.snap is salvaged on /opt. Yesterday's changes except SDF exist only in /mnt/backup_controls. After cheking all files in /mnt/backup_controls, it should be removed and one more backup disk copied by root must be created.
LAS (bKAGRA laser)
takahiro.yamamoto - 21:57 Friday 03 May 2024 (29389) Print this report
Comment to Laser down again (29388)
According to the laser power monitor (K1:LAS-POW_FIB_DC_INMON) as shown in the top panel of Fig.1,
- [T1 cursol] a glitch occurs at 17:40 (maybe caused by restarting real-time model)
- [crosshair] data became strange at 17:42 (maybe caused by stoppting DAQ stream)
- [t=0] power goes to 0W at 18:30.

After 17:42, past data is not reliable because DAQ stream was stopped. But PMC guardian which read EPICS record directly went to FAULT around 18:30 (see also the middle panel of Fig.1). So interlock seemed to work around 18:30.

Around 18:30, I rebooted the front-end computers as follows.
- 18:25 k1als0 was rebooted.
- 18:29 k1ioo0 was rebooted.
- 18:35 k1ioo1 was rebooted.

On Tuesday, interlock worked before we started the DGS maintenance. So the DGS works were not related to the interlock. If the cause of working interlock in this time is same as one on Tuesday, today's my work should not be related to the interlock behavior. But if it's different and the interlock has an electrical connection with the PSL table, rebooting digital system around REFL may be able to affect the interlock behavior.


BTW, PT100 on the PSL table shows strange values on the SummaryPages. But there is no such strange behavior on the ndscope via k1nds0 (see the bottom panel of Fig.1). I'm not sure which NDS server is used to read past data on SummaryPages. Anyway, SummaryPages data in this evening is not so reliable.

Images attached to this comment
DGS (General)
takahiro.yamamoto - 11:20 Thursday 02 May 2024 (29376) Print this report
Comment to Taking backup of the core system of DGS (29361)
The NFS region were unavailable on some dgs workstations
=> I re-mounted it on each WS. It's now available.

awgtpman problem
=> I launched awgtpman as the only-tpman (without awg) mode as a temporary solution.
Test points are now available and excitations are unavailable.
Full-scale investigation and restoration works will be done tomorrow in order to avoid conflicts with today's works.
DGS (General)
takahiro.yamamoto - 2:16 Thursday 02 May 2024 (29374) Print this report
Comment to Taking backup of the core system of DGS (29361)
I found that if awgtpman was in an only tpman mode or an only awg mode, it would not terminate. This means there is no problem on the function of awg and tp. Because awggui requires both test point and excitation channels, I couldn't check the excitation works well in the only awg mode. At least test point is available in the only tpman mode and I have been able to see test points by ndscope, diaggui etc. But awgtpman process can run only one process per real-time models. So we must find the way to launch awgtpman in awg+tpman mode.

According to log analysis of awgtpman when it was launched, a timing of when it exits appears to be random. Since it shouldn't take so long time to launch awgtpman, it is possible that a timeout due to NFS performance (reading configuration files and channel list) is occurring. So I plan to reboot DAQ and real-time front-ends again after restoring NFS settings.
DGS (General)
takahiro.yamamoto - 18:41 Wednesday 01 May 2024 (29370) Print this report
Comment to Taking backup of the core system of DGS (29361)
We continued to the maintenance work today.
DAQ servers and real-time models were came back with the backup disk.
But awgtpman couldn't be launched on almost all models in some reason.
So only DQ channels are available and TP and EXC channels are unavailable now.

I have no good idea to solve this problem now.
I'll try to make a plan what we should do until tomorrow...

DGS (General)
takahiro.yamamoto - 22:15 Tuesday 30 April 2024 (29361) Print this report
Taking backup of the core system of DGS
[Ikeda, Nakagaki, YamaT]

Abstract

We took backups of the system region of DAQ servers and k1boot.
A backup process for the data region of k1boot will be completed early tomorrow morning.
Don't make any changes in /opt/rtcds tonight.

After taking backup, we will reboot all DAQ servers and real-time front-ends tomorrow.

Details

Back up of DAQ servers
We took a system backup of the DAQ servers for the first time in a year. All DAQ servers (k1dc0, k1fw0, k1fw1, k1nds0, k1nds1, k1tw0, k1tw1, and k1bcst0) came back online without any troubles after taking backup. Now all servers run with new HDDs in which all files are copied from the HDD used until today's morning. Old HDDs used until today's morning are kept in the HDD slot of each server as a latest back up with tagging as back up of 2024.04.30. Backup of one-generation ago which has been taken in the last spring (Feb.-Apr. 2023) is also kept in the server slot. Backup of two- (or more) generation ago was brought back to Mozumi.

Back up of system region of k1boot
Though we tried to take a backup of the system disk of k1boot, HDD couldn't be copied due to a disk error. This HDD was newly prepared in last November. So a disk error was occurred only in recent 6 months. This system couldn't be used for the boot disk. But it could be mounted by another system. So we mounted this broken HDD from another backup system taken in last November and copied all changes in recent 6 months (/etc/init.d/mx_stream and /diskless/root/etc/rtsystab). After then, we copied current system files to two new HDDs. One of them are now being used for current k1boot system and another one is kept as a latest backup tagged as 2024.04.30. Because broken disk can be still mounted as data region, it's also kept just in case.

Back up of data region of k1boot
Backup process of data region also failed. And also, it's difficult to salvage all changes in recent 6 months (all model updates, filter updates, medm updates, etc.). So we are now trying to copy files one-by-one by rsync process. This process probably completed tomorrow early morning (2am or 3am? It's difficult to estimate accurate time because a copy speed is not so stable.). So please don't make any changes on NFS region tonight. Changes in tonight will vanish. We cannot ensure to salvage them. Remaining work will be done tomorrow. After taking all backups, DAQ servers and all real-time front-end will be rebooted.
Comments to this report:
takahiro.yamamoto - 18:41 Wednesday 01 May 2024 (29370) Print this report
We continued to the maintenance work today.
DAQ servers and real-time models were came back with the backup disk.
But awgtpman couldn't be launched on almost all models in some reason.
So only DQ channels are available and TP and EXC channels are unavailable now.

I have no good idea to solve this problem now.
I'll try to make a plan what we should do until tomorrow...

takahiro.yamamoto - 2:16 Thursday 02 May 2024 (29374) Print this report
I found that if awgtpman was in an only tpman mode or an only awg mode, it would not terminate. This means there is no problem on the function of awg and tp. Because awggui requires both test point and excitation channels, I couldn't check the excitation works well in the only awg mode. At least test point is available in the only tpman mode and I have been able to see test points by ndscope, diaggui etc. But awgtpman process can run only one process per real-time models. So we must find the way to launch awgtpman in awg+tpman mode.

According to log analysis of awgtpman when it was launched, a timing of when it exits appears to be random. Since it shouldn't take so long time to launch awgtpman, it is possible that a timeout due to NFS performance (reading configuration files and channel list) is occurring. So I plan to reboot DAQ and real-time front-ends again after restoring NFS settings.
takahiro.yamamoto - 11:20 Thursday 02 May 2024 (29376) Print this report
The NFS region were unavailable on some dgs workstations
=> I re-mounted it on each WS. It's now available.

awgtpman problem
=> I launched awgtpman as the only-tpman (without awg) mode as a temporary solution.
Test points are now available and excitations are unavailable.
Full-scale investigation and restoration works will be done tomorrow in order to avoid conflicts with today's works.
VIS (EX)
takahiro.yamamoto - 19:20 Monday 29 April 2024 (29351) Print this report
ETMX PAY was tripped
ETMX PAY was tripped at 17:57 on Apr. 27th (Sat.) JST (see Fig.1).
After then, it was recovered at 11:15 on Apr. 29th (Mon.) JST but it was tripped again at 11:46 on Apr. 29th (Mon.) JST.

In both cases, a software watchdog detected the oscillation and saturation on the sensor signals of the MN stage (see also Fig.2 and Fig.3). This is just a matter of asking experts to check the control and/or sensors.

Guardian cannot escape from tripped states in automatically. So someone recovered ETMX once and ETMX tripped again. But I couldn't find any klog posts. It may be a more serious problem than a trip itself...

By the way, sound output of k1mon0 had been set as on-board speaker not a monitor connected via HDMI. Because of this, sound notification by VIS guardian couldn't be heard at all. It's my mistake on the settings of sound output when the system has been upgraded. This issue was fixed today.
Images attached to this report
Comments to this report:
takafumi.ushiba - 9:55 Tuesday 30 April 2024 (29356) Print this report

I checked the reason why ETMX was oscillated.
Figure 1 and 2 show the signals of NB filters and MN DAMP filters, respectively.

DOF5 and MN_DAMP_L was oscillated at 5.1 Hz, which should be damped by DOF5.
So, it is very likely that the reason of the oscillation is DOF5 NB filter.
This filter was optimized when ETMX is at 90 K, so it is neccesary to optimize it again at the current temperature.

Images attached to this comment
DGS (General)
takahiro.yamamoto - 18:39 Tuesday 23 April 2024 (29302) Print this report
MEDM time machine is available

MEDM time machine works well.
It can show past values recorded in frame files on MEDM screens and can be launched from the MEDM execute menu.

After choosing a lock back time as relative time, gps time, local time etc. (see also the left windown in Fig.1), a new MEDM screen is launched. Channels not recorded in frame files such as _OUTMON are displayed as white boxes instead of record values as shown in the bottom window in Fig.1

Images attached to this report
Comments to this report:
takahiro.yamamoto - 19:34 Sunday 05 May 2024 (29397) Print this report
The time machine code was modified as follows.
1. '.*_OUTMON' defined by cdsFilt (filterbank) blocks show past values of '.*_OUTPUT'.
2. '.*_OUTMON' defined by cdsEpicsOutput blocks show their past values.
(This modification doesn't affect to MEDM showing current values.)

-----
A channel name as '.*_OUTMON is used by one of the output monitor of the filterbank which is NOT recorded by DAQ. So the original (LIGO's) code of the time machine ignores '.*_OUTMON' and shows them as a white square. Unfortunately, many MEDM screens use '.*_OUTMON' instead of '.*_OUTPUT' or '.*_OUT16' in the KAGRA case. And also, some models define channels named as '.*_OUTMON by cdsEpicsOutput blocks which are DAQ-ed channels. Because of such KAGRA's situation, the time machine hadn't been useful for the MEDM screens displaying such '.*_OUTMON' channels.

Possible solutions
'.*_OUTMON' defined by cdsFilt blocks can monitor the output of filter modules even when the OUT_SW of the filter bank is off, but such a function is likely to be used by only a few experts. So almost all '.*_OUTPUT' on current MEDM screens had better to replaced as '.*_OUTPUT'.

'.*_OUTMON' for cdsEpicsOutput may be better to be renamed other name because many tools regard '.*_OUTMON' as a non-DAQ-ed channel.

Because these better solutions seems to require large modifications of real-time models and MEDM screens and much longer time, I modified the treatment of '.*_OUTMON in the time machine code.
DGS (General)
takahiro.yamamoto - 14:40 Saturday 20 April 2024 (29286) Print this report
Weekly maintenance
Upgrade of client workstations
Client workstations in the control room were upgraded to the Debian12 system. Environment of workstation at Mozumi was unified by this work. Workstations in the mine will be also upgraded in near future.

Security updates for gwdet, k1gate, and client workstations
gwdet and k1gate was rebooted for applying updates. There should be no impact for users.

Client workstations in the control room were also rebooted. Some errors in gst-launch with the new kernel and new Xorg libraries on nouveau. Though there is no urgent problem for using GigE, we must take care about the stability in long term operation especially for k1mon3 and k1mon4.

Update of server certification for gwdet
I replaced a server certification file for gwdet because an old certification file would be expired soon. Expired day of a new certification is May 2025. There is no thing that end users need to take care about.

Change in a network configuration of k1dc0
k1dc0 had been replaced in the work of klog#29110. Some network configurations were not set properly in this work. So I reset them as follows.
ethtool -C myri0 rx-usecs 1
ethtool -C myri1 rx-usecs 1

There was no down time by this issue.
VAC (IYA)
takahiro.yamamoto - 17:42 Tuesday 16 April 2024 (29236) Print this report
Recovery of vacuum readout at IYA

Abstract

I was informed by Uchiyama-san that the IYA vacuum readout on the digital system had stopped.
It was caused by some troubles on the VLAN switch and problem was solved by using an another network switch.
Vacuum DAQ for IYA came back online around 14:27 JST.
 

Details

When I was informed by Uchiyama-san that the IYA vacuum readout was stopped, I couldn't access RaspPi@IYA via the network. So I asked Uchiyama-san to unplug/replug a UTP cable and after then a USB power cable. But RaspPi@IYA was still unreachable. Because it's difficult to continue the recovery work remotely, I went to IYA to access the RaspPi.

According to my inspection at IYA, the RaspPi system seemed to work fine and NIC was lined up properly. At first I doubted a failure on an UTP cable or a LAN port on the switch, but there was no improvement even if I replaced an UTP cable and changed a LAN port. Next, I doubted VLAN setting was something changed. So I changed a network configuration of RaspPi from static IP to DHCP because I expected that IP address for DGS or CATV was assigned instead of one for ICRR. But IP address was assigned as ICRR segment.

This situation seemed to be a problem on firewall like system not a network connection itself. So I stopped RaspPi and connected it to the network switch at BS workstation area for the DGS network and one at Cryocon rack at IXC for the ICRR network. On both network, RaspPi became reachable. So I could localize the problem on the network switch at IYA.

Finally, I asked Oshino-san about the port assignment of two VLAN switches at IYA and the UTP cable for the RaspPi was moved from port#7P of the left switch to port#3 of the right switch (see also Fig.1 and Fig.2). Then, vacuum readout on the digital system came back online around 14:27 JST.

According to the DQ channel as shown in Fig.3, Readout of IYA seemed to stop around 9:53 JST on Feb. 21st. Though it came back online once around 10:48 JST on Feb. 24th, it was stopped again around 14:44 JST on Feb. 27th.

Images attached to this report
DGS (General)
takahiro.yamamoto - 3:17 Tuesday 16 April 2024 (29233) Print this report
System upgrade of NUC workstation
As the test of OS upgrade on NUC workstation, k1ctr21 at control room was updated to Debian12.
If you find any troubles, please let me know.

-----
I installed a new system to the spare M.2 SSD and kept the original M.2 SSD (Debian10) as a backup. A spare disk had been used as k1ctr10 at IXC. Because a power supply unit of IXC workstation was broken M.2 SSD and RAM was removed from the computer chassis. At first, I just tried to use the k1ctr10 disk (Debian10) on the k1ctr21 chassis as the test of using copied disk on other workstaion. But it couldn't boot up due to the secure boot. I have not tried that many things, but it may be difficult to use copied disks on the Intel NUC workstation with UEFI (Situation may be improved by BIOS upgrade?).

Next, the k1ctr10 disk is no longer needed, so a new Debian12 system was installed here. The installation procedure for HP Z2 workstations (JGW-T2415617) worked fine for NUC. Though some packages related to wireless LAN was installed in addition to the package list on HP Z2, it doesn't seem to affect unification of workstation environment.
IOO (IMC)
takahiro.yamamoto - 16:58 Friday 12 April 2024 (29211) Print this report
Mitigating a timing gap of BO operation for the IMC common mode servo

[Tomura, Kamiizumi, YamaT]
 

Abstract

We updated the IMC model to mitigate glitches coming from a timing gap of the BO operation, which makes a lockloss related to IMC and/or CARM controls.
Now, changing gain on MC common mode servo can be done more smoothly same as the test on the standalone system.
Because we cannot output laser today due to another works, we haven't tested this code with locked IMC yet.
 

Details

When we faced locklosses related to the gain change of IMC servo first time, we doubted the difference on the offset between 16dB and 15dB as a cause of lockloss (see also klog#26657). But we found larger change in offset during a gain change than the offset difference between 16dB and 15dB by some tests on the standalone system as discussed in the thread of klog#29112. Changing gain especially for 16dB->15dB and 8dB -> 7dB by BO makes glitches because delay of photocoupler is different between ON and OFF. BO supported by LIGO RCG has an asymmetric delay as 200us (maximum). So, when we change a gain from 16dB (0b10000) to 15dB (0b01111), the actual change is 16dB (0b10000) -> 31dB (0b11111) -> 15dB (0b01111) within 200us as shown in Fig.1. Finally, we could mitigate this glitch by preparing a new real-time model to mitigate the timing gap of BO on the standalone system.

Because the new real-time model seemed to work fine on the standalone system, we installed it in the IMC model. By mitigating a timing gap of BO operation, a change in an offset voltage at OUT2 is also mitigated from Fig.1 to Fig.2 (Fig.1 was taken in last week because I forgot to save it today...). In the new real-time model, ON and OFF on BO operation can be done almost simultaneously. So we can change a gain from 16dB to 15dB not via 31dB. A situation in the case of 8dB -> 7dB is also improved from Fig.3 to Fig.4.

Latch and intentional delay can be controlled on the MEDM screen (see above IN1GAIN controller in Fig.5). Please keep "ON" for "Use Latch" button if you want to use latch function. When it's in "OFF", latch function is disabled. About a sample number of delay, we confirmed it worked well in N=0~2. Latch channel and gain channel are output from another BO cards each other because of the pin assignment. So this delay is prepared to compensate an individual difference of two BO cards if it's too large. When we will install this model also for CARM common mode servo, we must confirm same things.

Images attached to this report
Comments to this report:
kenta.tanaka - 19:40 Monday 15 April 2024 (29231) Print this report

After the mitiigaing work, We tried to switch the IMC CMS IN1GAIN between 15dB and 16dB with IMC locking. Thanks to the mitigaition, IMC could keep the lock with the swithing (Fig.1). It means that we can increase the input power without any trouble. 

Images attached to this comment
Search Help
×

Warning

×